Classification Rules Reference
Overview
Classification Rules are the core components that contain the detection logic for identifying sensitive data patterns. Each rule combines policy elements, matching elements, and confidence thresholds to create sophisticated content detection capabilities.
Rule Structure
Basic Rule Structure
<Entity id="rule-identifier" patternsProximity="300" recommendedConfidence="75">
<Pattern confidenceLevel="85">
<IdMatch idRef="pattern-reference"/>
<Match idRef="supporting-evidence"/>
</Pattern>
<Pattern confidenceLevel="65">
<IdMatch idRef="alternative-pattern"/>
<Any minMatches="2">
<Match idRef="keyword-group-1"/>
<Match idRef="keyword-group-2"/>
<Match idRef="keyword-group-3"/>
</Any>
</Pattern>
</Entity>
Rule Types
Entity Rules
Entity rules detect specific data types with high confidence patterns.
| Attribute | Type | Description | Required | Default | Range |
|---|---|---|---|---|---|
id | String | Unique rule identifier | Yes | - | Must be unique within rule pack |
patternsProximity | Integer | Maximum distance between patterns | No | 300 | 1-1000 characters |
recommendedConfidence | Integer | Recommended confidence threshold | No | 75 | 1-100 |
Example:
<Entity id="credit-card-number" patternsProximity="300" recommendedConfidence="85">
<!-- Pattern definitions -->
</Entity>
Evidence Rules
Evidence rules look for supporting context that increases confidence in detections.
<Evidence id="financial-context" patternsProximity="150">
<Pattern confidenceLevel="40">
<IdMatch idRef="financial-keywords"/>
</Pattern>
</Evidence>
Proximity Rules
Proximity rules check for related content within a specified distance.
<Proximity id="payment-context" patternsProximity="200">
<Pattern confidenceLevel="30">
<IdMatch idRef="payment-terms"/>
<Match idRef="amount-patterns"/>
</Pattern>
</Proximity>
Affinity Rules
Affinity rules detect relationships between different data elements.
<Affinity id="personal-financial" patternsProximity="500">
<Pattern confidenceLevel="60">
<IdMatch idRef="ssn-pattern"/>
<Match idRef="credit-card-pattern"/>
</Pattern>
</Affinity>
Similarity Rules
Similarity rules find content that matches known sensitive patterns.
<Similarity id="document-similarity" recommendedConfidence="70">
<Pattern confidenceLevel="80">
<IdMatch idRef="document-fingerprint"/>
</Pattern>
</Similarity>
Pattern Elements
Pattern Structure
Patterns define the specific conditions that must be met for a rule to match.
| Attribute | Type | Description | Required | Range |
|---|---|---|---|---|
confidenceLevel | Integer | Confidence level for this pattern | Yes | 1-100 |
IdMatch Element
The IdMatch element specifies the primary pattern that must be found.
| Attribute | Type | Description | Required | Example |
|---|---|---|---|---|
idRef | String | Reference to a pattern or keyword resource | Yes | "credit-card-regex" |
Example:
<IdMatch idRef="ssn-pattern"/>
Match Element
Match elements specify supporting evidence or additional patterns.
| Attribute | Type | Description | Required | Example |
|---|---|---|---|---|
idRef | String | Reference to a pattern or keyword resource | Yes | "financial-keywords" |
Example:
<Match idRef="payment-keywords"/>
Any Element
The Any element allows matching any of several patterns with minimum match requirements.
| Attribute | Type | Description | Required | Default | Range |
|---|---|---|---|---|---|
minMatches | Integer | Minimum number of child patterns that must match | No | 1 | 1 to number of children |
Example:
<Any minMatches="2">
<Match idRef="keyword-group-1"/>
<Match idRef="keyword-group-2"/>
<Match idRef="keyword-group-3"/>
</Any>
Confidence Levels
Confidence Calculation
Confidence levels determine how certain the system is about a detection.
| Level | Range | Description | Use Case |
|---|---|---|---|
| Low | 1-40 | Weak indicators | Supporting evidence |
| Medium | 41-70 | Moderate confidence | Contextual matches |
| High | 71-85 | Strong confidence | Primary patterns |
| Very High | 86-100 | Definitive matches | Validated patterns |
Pattern Confidence
Each pattern within a rule has its own confidence level that contributes to the overall match confidence.
<Pattern confidenceLevel="85">
<IdMatch idRef="validated-ssn-pattern"/>
<Match idRef="ssn-keywords"/>
</Pattern>
<Pattern confidenceLevel="65">
<IdMatch idRef="possible-ssn-pattern"/>
<Any minMatches="2">
<Match idRef="personal-keywords"/>
<Match idRef="document-keywords"/>
<Match idRef="form-keywords"/>
</Any>
</Pattern>
Matching Elements Reference
Built-in Pattern Types
Regular Expression Patterns
<Regex id="ssn-pattern">
<Pattern>\b\d{3}-?\d{2}-?\d{4}\b</Pattern>
</Regex>
Keyword Groups
<Keyword id="financial-terms">
<Group matchStyle="word">
<Term>account</Term>
<Term>balance</Term>
<Term>payment</Term>
</Group>
</Keyword>
Built-in Functions
| Function | Description | Validation |
|---|---|---|
Func_credit_card_formatted | Credit card with formatting | Luhn algorithm |
Func_credit_card_unformatted | Credit card without formatting | Luhn algorithm |
Func_ssn_formatted | SSN with dashes | Format validation |
Func_ssn_unformatted | SSN without formatting | Format validation |
Example:
<IdMatch idRef="Func_credit_card_formatted"/>
Advanced Pattern Matching
Proximity Matching
Patterns can specify how close different elements must be to each other.
<Entity id="credit-card-with-context" patternsProximity="300">
<Pattern confidenceLevel="90">
<IdMatch idRef="credit-card-pattern"/>
<Match idRef="credit-card-keywords"/>
</Pattern>
</Entity>
Conditional Logic
Complex conditions can be built using logical operators.
<Pattern confidenceLevel="75">
<IdMatch idRef="account-number"/>
<Any minMatches="1">
<Match idRef="bank-keywords"/>
<Match idRef="routing-keywords"/>
</Any>
</Pattern>
Exclusion Patterns
Patterns can exclude certain matches to reduce false positives.
<Pattern confidenceLevel="80">
<IdMatch idRef="ssn-pattern"/>
<Match idRef="personal-context"/>
<Not>
<Match idRef="test-data-keywords"/>
</Not>
</Pattern>
Rule Examples
Credit Card Detection Rule
<Entity id="credit-card-number" patternsProximity="300" recommendedConfidence="85">
<Pattern confidenceLevel="95">
<IdMatch idRef="Func_credit_card_formatted"/>
<Any minMatches="1">
<Match idRef="credit-card-keywords"/>
<Match idRef="payment-keywords"/>
</Any>
</Pattern>
<Pattern confidenceLevel="85">
<IdMatch idRef="Func_credit_card_unformatted"/>
<Any minMatches="2">
<Match idRef="credit-card-keywords"/>
<Match idRef="payment-keywords"/>
<Match idRef="financial-keywords"/>
</Any>
</Pattern>
<Pattern confidenceLevel="70">
<IdMatch idRef="credit-card-regex"/>
<Any minMatches="3">
<Match idRef="visa-keywords"/>
<Match idRef="mastercard-keywords"/>
<Match idRef="amex-keywords"/>
<Match idRef="payment-context"/>
</Any>
</Pattern>
</Entity>
Social Security Number Rule
<Entity id="ssn-detection" patternsProximity="200" recommendedConfidence="80">
<Pattern confidenceLevel="90">
<IdMatch idRef="Func_ssn_formatted"/>
<Match idRef="ssn-keywords"/>
</Pattern>
<Pattern confidenceLevel="75">
<IdMatch idRef="Func_ssn_unformatted"/>
<Any minMatches="2">
<Match idRef="ssn-keywords"/>
<Match idRef="personal-keywords"/>
<Match idRef="government-keywords"/>
</Any>
</Pattern>
</Entity>
Bank Account Number Rule
<Entity id="bank-account-number" patternsProximity="250" recommendedConfidence="75">
<Pattern confidenceLevel="85">
<IdMatch idRef="bank-account-regex"/>
<Any minMatches="1">
<Match idRef="routing-keywords"/>
<Match idRef="bank-keywords"/>
</Any>
</Pattern>
<Pattern confidenceLevel="70">
<IdMatch idRef="account-number-pattern"/>
<Any minMatches="2">
<Match idRef="banking-keywords"/>
<Match idRef="financial-keywords"/>
<Match idRef="account-keywords"/>
</Any>
</Pattern>
</Entity>
Performance Optimization
Pattern Ordering
Order patterns by confidence level (highest first) for optimal performance.
<Entity id="optimized-rule">
<!-- Highest confidence pattern first -->
<Pattern confidenceLevel="95">
<IdMatch idRef="high-confidence-pattern"/>
</Pattern>
<!-- Lower confidence patterns follow -->
<Pattern confidenceLevel="75">
<IdMatch idRef="medium-confidence-pattern"/>
<Match idRef="supporting-evidence"/>
</Pattern>
</Entity>
Proximity Settings
Use appropriate proximity values based on content type:
| Content Type | Recommended Proximity | Reason |
|---|---|---|
| Structured Forms | 100-200 characters | Fields are close together |
| Documents | 300-500 characters | Context may be spread out |
| Email Content | 200-400 characters | Mixed structured/unstructured |
| Database Exports | 50-150 characters | Highly structured data |
Keyword Optimization
- Use specific keywords over generic terms
- Group related keywords together
- Limit keyword lists to essential terms
- Use word matching for better precision
Validation and Testing
Rule Validation
- Syntax Validation: Ensure XML is well-formed
- Reference Validation: Verify all idRef attributes point to valid resources
- Logic Validation: Check pattern logic makes sense
- Performance Testing: Test with representative content
Testing Methodology
- Positive Testing: Verify rules match intended content
- Negative Testing: Ensure rules don't match unintended content
- Boundary Testing: Test edge cases and limits
- Performance Testing: Measure processing time and resource usage
Best Practices
Rule Design
- Start Simple: Begin with basic patterns, add complexity gradually
- Use Multiple Patterns: Provide different confidence levels
- Include Context: Use supporting keywords for better accuracy
- Test Thoroughly: Validate with real-world content samples
Performance
- Optimize Proximity: Use smallest effective proximity values
- Order Patterns: Place highest confidence patterns first
- Limit Complexity: Avoid overly complex logical conditions
- Monitor Performance: Track rule execution times
Maintenance
- Version Control: Track rule changes over time
- Regular Review: Periodically assess rule effectiveness
- Update Keywords: Keep keyword lists current
- Performance Monitoring: Watch for degradation over time